mbDenoise: microbiome data denoising using zero-inflated probabilistic principal components analysis

The analysis of microbiome data has several technical challenges. In particular, count matrices contain a large proportion of zeros, some of which are biological, whereas others are technical. Furthermore, the measurements suffer from unequal sequencing depth, overdispersion, and data redundancy. These nuisance factors introduce substantial noise. We propose an accurate and robust method, mbDenoise, for denoising microbiome data. Assuming a zero-inflated probabilistic PCA (ZIPPCA) model, mbDenoise uses variational approximation to learn the latent structure and recovers the true abundance levels using the posterior, borrowing information across samples and taxa. mbDenoise outperforms state-of-the-art methods to extract the signal for downstream analyses. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02657-3).


Simulation experiments with a covariate
Examples M1-M3 were extended to M7-M9 by setting for M7 and M8 and where v i is the covariate (e.g., healthy versus diseased) and γ j is the coefficient. In the same way as in M7 and M8, example M5 was extended to M10. We set

Simulation experiments for DA testing
Examples M7-M9 with varied γ j were used in DA analysis. Specifically, we set for M9, where s ∈ {0, 1, 2, 3, 4, 5} represents the effect size. The rest of the parameters were the same as those in Table S1.2.

Additional results for simulation experiments
We present results for settings with n > p, and those for settings with a covariate. Note that mbDenoise-zinb-cov, mbDenoise-zip-cov, and PPCA-NB-cov denote the covariate-adjusted version of mbDenoise-zinb, mbDenoise-zip, and PPCA-NB, respectively.

Simulation results for settings with n > p
Fig. S2.1 shows estimation and prediction errors for mbDenoise and other methods for settings with n > p. We see that the results were similar to those for settings with n < p in the main text. F1 and F2 denote the orthogonal projection distance and symmetric Procrustes error for predicting f i , B1 and B2 represent those for estimating β j , all averaged over 100 data replications, and time is the average computation time in seconds on the log base 10 scale. Absence of results fort-SNE in B1 and B2 was due to no estimation of β j in t-SNE.  Boxplots of error measures of estimation and prediction for mbDenoise and other methods for settings with a covariate and n > p. F1 and F2 denote the orthogonal projection distance and symmetric Procrustes error for predicting f i , B1 and B2 represent those for estimating β j , all averaged over 100 data replications, and time is the average computation time in seconds on the log base 10 scale. Absence of results for t-SNE in B1 and B2 was due to no estimation of β j in t-SNE. C1-C4 denote the Frobenius norm error, Kullback-Leibler divergence, Shannon's index mean squared error, and Simpson's index mean squared error, and time stands for computation time in seconds, all averaged over 100 data replications and on the log base 10 scale. Absence of results for SAVER_zr in M6 was due to an exception in SAVER.

Simulation results for settings with n > p
Fig. S2.7 shows errors of data recovery for mbDenoise and other methods for settings with n > p.
We can see that the results were similar to those for settings with n < p in the main text.

Simulation results for other DA testing methods
From Fig. S2.10, we can see that CSS and TMM performed poorly with very low recall, ZINB had very low precision in M7 and M8, and ANCOM had very low recall in M7-M9. Overall, the proposed method mbDenoise-zinb-cov outperformed ZINB, ANCOM, CSS and TMM.

Simulation results with data generated from SparseDOSSA 2
We also applied SparseDOSSA 2 to simulate data for DA analysis. From Table S2.1, we can see that the proposed method mbDenoise-zinb-cov performed well compared to the best that was done in each case. The Poisson and negative binomial distributions have been widely studied and used for modeling univariate count-valued data. Multivariate generalizations of them that permit rich dependencies, however, have been far less popular, and can be divided into two main classes: (i) where the marginal distributions are Poisson/negative binomial, and (ii) where the joint distribution is a mixture of independent Poisson/negative binomial distributions. Many of the recent developments in the first class use the general copula framework. The sparseDOSSA 2 method also belongs to this class, with some modifications. In particular, it specifies zero-inflated log-normal marginal distributions, invokes normalization to address compositionality, models feature-feature correlations through a multivariate Gaussian copula, and uses a penalized estimation procedure to address high dimensionality. The proposed method mbDenoise belongs to the second class by specifying zero-inflated negative binomial marginals and accounting for compositionality. The main difference between mbDenoise and sparseDOSSA 2 arises from the way that high dimensionality is addressed. Instead of assuming sparse feature-feature correlations as in sparseDOSSA 2, an alternative is to assume a low rank correlation structure as in mbDenoise. Mixture models are particularly helpful if there is overdispersion in the data, which is often the case for microbiome count data. The ZIPPCA model underlying mbDenoise also belongs to the class of generalized latent variable models, which are increasingly being used for dimension reduction and for studying the factors driving co-occurrence among taxa. Nevertheless, exact inference of the parameters in latent variable models is typically computationally difficult due to the presence of latent mixing variables.

3
Additional results for empirical data analysis

Negative control of stool microbiomes of one geographical location
We conducted a negative control analysis of Bhopal samples by randomly assigning a binary label (Bhopal/Kerala) to each sample. Ordination analysis and diversity estimation in Fig. S3.1a and S3.1b show that there was no significant difference in community composition between the two groups, for all methods except PPCA-NB. Fig. S3.1c shows that metagenomeSeq, t-test, SAVER, and mbDenoise-zinb-cov detected few or no species. By contrast, DESeq2, mbImpute, and egdeR all identified a few species, suggesting that they could not control the false discovery rate. , and denoising methods (PPCA-NB, mbDenoise-zinb and SAVER) by applying PCA and t-SNE to the denoised data. Inputs of PCA and t-SNE were log-transformed. Beta diversity was assessed using permutational multivariate analysis of variance (PERMANOVA). b Alpha diversity analysis. Included methods for composition estimation were zr, svt, pmr, dmm, empirical Bayes estimate by PPCA-NB and by mbDenoise-zinb, and zr using the denoised data from SAVER (SAVER_zr). Significance was calculated using the Wilcoxon test. c Visualization of sets of differentially abundant (DA) species between the two groups. Shown are sets detected by t-test, DEseq2, edgeR, metagenomeSeq, and t-test applied to imputed/denoised data (mbImpute, SAVER, and mbDenoise-zinb-cov). We used a FDR threshold of 0.05. Fig. S3.2a reveals significant community distinctions between CP and control groups, and between gingiva and tongue, but hardly any difference between subg and supra sites. We further carried out alpha diversity analysis using tongue samples from CP patients and healthy controls (Fig. S3.2b). Only mbDenoise-zinb revealed that patients with CP had significantly higher alpha diversity than those of control subjects, which is consistent with previous studies. .2: Analysis of oral microbiomes of chronic periodontitis. a Data ordination and beta diversity analysis between CP patients and healthy controls, and between gingival (subg and supra) sites and tongue. Included were algorithm-based (PCA and t-SNE), model-based (ZIFA, PPCA-NB, mbDenoise-zinb), and denoising methods (PPCA-NB, mbDenoise-zinb and SAVER) by applying PCA and t-SNE to the denoised data. Inputs of PCA and t-SNE were log-transformed. Beta diversity was assessed using PERMANOVA. b Alpha diversity analysis of tongue samples from CP patients and healthy controls. Included methods for composition estimation were zr, svt, pmr, dmm, empirical Bayes estimate by PPCA-NB and by mbDenoise-zinb, and zr using the denoised data from SAVER (SAVER_zr). Significance was calculated using the Wilcoxon test. Absence of results for dmm was due to an exception that disrupts the program execution.  Table 1 in the main text. There was evidence of community dissimilarity between CRC patients and healthy controls (Fig. S3.3a and S3.3b). Furthermore, there was no discernible difference in alpha diversity between the two groups, with the exception of mbDenoise-zinb in the fifth dataset (Fig. S3.4b).  Table 1 in the main text. Included were algorithm-based (PCA and t-SNE), model-based (ZIFA, PPCA-NB, mbDenoise-zinb), and denoising methods (PPCA-NB, mbDenoise-zinb and SAVER) by applying PCA and t-SNE to the denoised data. Inputs of PCA and t-SNE were log-transformed, and the empty was due to an exception in ZIFA. Beta diversity was assessed using PERMANOVA.  Table 1 in the main text. Included methods for composition estimation were zr, svt, pmr, dmm, empirical Bayes estimate by PPCA-NB and by mbDenoise-zinb, and zr using the denoised data from SAVER (SAVER_zr). Significance was calculated using the Wilcoxon test.

Negative control of stool microbiomes of healthy controls
Again, we conducted a negative control analysis of healthy samples by randomly assigning a binary label (CRC/control) to each sample. Fig. S3.5 shows that the proportions of species declared DA by mbDenoise-zinb-cov were less than those by DESeq2 and mbImpute.

Comparison of mbDenoise with other DA testing methods
We also compared our method with ZINB, ANCOM, CSS and TMM in real data analysis. Fig. S3.6 and S3.7 show that, ZINB was prone to more false discoveries and ANCOM was very conservative. CSS and TMM performed poorly with very low recall, and their performance was similar to that of ANCOM. These were consistent with the findings in the simulation (see Fig. S2.10). , and t-test applied to normalized data by CSS and TMM, and to denoised data by mbDenoise-zinb-cov. We used a FDR threshold of 0.05. Before carrying out DA analysis, each healthy sample in (b) was randomly assigned a binary label (Bhopal/Kerala). .7: Visualization of sets of differentially abundant (DA) species a between CP and control groups, and b between subg and supra sites. Shown are sets detected by ANCOM, ZINB, and t-test applied to normalized data by CSS and TMM, and to denoised data by mbDenoise-zinb-cov. We used a FDR threshold of 0.05.

The proposed algorithm
By replacing the intractable posterior distribution p(f i , z i | x i ) of the latent variables (f i , z i ) with a variational famliy of distributions q(f i , z i ) , our VA algorithm can be viewed as a relaxation to EM: estimating the variational parameters amounts to an approximate E-step, while estimating the model parameters corresponds to an approximate M-step. Thus, we maximize the ELBO by alternating between updating model parameters with variational parameters fixed, and updating variational parameters with model parameters fixed, as outlined in Algorithm 1.